So far, we have discussed estimating average treatment effects (ATE) for a study population.
But it’s very unlikely that a treatment will have the same effect for every single person.
In subgroup analysis, we divide up our sample (e.g. using blocking) an essentially conduct a miniature experiment within each subgroup.
We can then estimate conditional average treatment effects (CATEs) within each subgroup.
Using interaction terms, we can also test whether treatment effects differ significantly between subgroups.
Why might we want to look for treatment heterogeneity?
In a classic study from the 1940s, American psychologists presented a group of children with a black and white doll, and asked them which doll they would like to play with.
Suppose that out of a group of 20 children, 16 selected the white doll. How would you interpret this result?
Suppose that the researchers also coded the race of the children. They found that 15 out of 15 white kids selected the white doll, while 4 out of 5 black kids selected the black doll. Now how would you interpret these results?
Suppose instead that 12 out of 15 white kids selected the white doll, and 4 out of 5 black kids also selected the white doll. How does your interpretation change?
| Example: Treatment Heterogeneity with Randomization Inference | ||||
| Achievement | Treatment Status | Y(O) | Y(1) | |
|---|---|---|---|---|
| A | low | 0 | 1 | |
| B | low | 0 | 2 | |
| C | low | 1 | 5 | |
| D | low | 1 | 5 | |
| E | high | 0 | 4 | |
| F | high | 0 | 5 | |
| G | high | 1 | 6 | |
| H | high | 1 | 6 | |
Download the data.
Questions:
Suppose that a researcher compares the CATE among two subgroups, men and women. Among men (N = 100), the ATE is estimated to be 8.0 with a standard error of 3.0, which is significant at p < 0.05. Among women (N = 25), the CATE is estimated to be 7.0 with an estimated standard error of 6.0, which is not significant, even at the 10% significance level.
When writing up these results, the researcher argues: “the treatment only works for men; for women, the effect is statistically indistinguishable from zero.”
What do you think about this claim?
If you estimate treatment effects amongst a large number of subgroups, odds are good that you will find a statistically significant CATE purely by chance.
For example, suppose a researcher assesses whether each of 20 covariates interacts with the treatment. For the sake of illustration, suppose the covariates are uncorrelated with one another and that the treatment has an effect of zero for all subjects.
The probability of finding at least one covariate that significantly inter acts with the treatment at the 0.05 significance level is 0.642.
Perhaps more striking is the fact that the probability of finding at least one covariate that significantly interacts with the treatment at the 0.01 significance level is 0.182.
Here’s another example using simulated data.
One (conservative) solution is to use a Bonferroni correction: if you conduct h hypothesis tests, divide your target p-value by h. For example, if you conduct 20 subgroup analyses at the 0.05 significance level, only reject the null hypothesis if your p-value for any particular comparison is below 0.0025.
If you’ve worked with data long enough, you have had this experience:
Don’t do this!
Instead, you can “tie your own hands” by specifying in advance (using a PAP) which subgroup analyses you plan to conduct.
Anything which is not pre-specified should be flagged as “exploratory analysis.” It’s fine to explore treatment heterogeneity. But we should also regard such hypothesis tests with skepticism pending replication by another study.
Or, if you have a large enough sample, just split your dataset in two. Use half to search for heterogeneity, and the other half to conduct statistical tests of CATEs. Essentially, this is what more sophisticated machine learning methods do.
Noticing that treatment effects tend to be large in some groups and absent from others can provide important clues about why treatments work. But this doesn’t mean that, by changing the moderating variable, we can also manipulate the treatment effect.
For example, suppose that in the cell phone experiment, we were somehow able to measure partipants SES (e.g. by looking at how they were dressed). Say we find that discrimination is strongest amongst people who looked poor. Does this mean that if we started handing out money to all Swiss citizens, we could reduce discrimination?
Bottom line: subgroup analysis is fundamentally non-experimental in character (akin to the use of covariates) and must be interpreted as such. Moderating variables can be used to predict differences in the CATE, but they do not cause the CATE to vary.
Please take a few minutes to fill out an (anonymous) midterm evaluation survey..